Goto

Collaborating Authors

 malicious data


Safe LoRA: The Silver Lining of Reducing Safety Risks when Finetuning Large Language Models

Neural Information Processing Systems

While large language models (LLMs) such as Llama-2 or GPT-4 have shown impressive zero-shot performance, fine-tuning is still necessary to enhance their performance for customized datasets, domain-specific tasks, or other private needs. However, fine-tuning all parameters of LLMs requires significant hardware resources, which can be impractical for typical users. Therefore, parameter-efficient fine-tuning such as LoRA have emerged, allowing users to fine-tune LLMs without the need for considerable computing resources, with little performance degradation compared to fine-tuning all parameters. Unfortunately, recent studies indicate that fine-tuning can increase the risk to the safety of LLMs, even when data does not contain malicious content. To address this challenge, we propose $\textsf{Safe LoRA}$, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace, effectively reducing the safety risks in LLM fine-tuning while maintaining utility. It is worth noting that $\textsf{Safe LoRA}$ is a training-free and data-free approach, as it only requires the knowledge of the weights from the base and aligned LLMs. Our extensive experiments demonstrate that when fine-tuning on purely malicious data, $\textsf{Safe LoRA}$ retains similar safety performance as the original aligned model. Moreover, when the fine-tuning dataset contains a mixture of both benign and malicious data, $\textsf{Safe LoRA}$ mitigates the negative effect made by malicious data while preserving performance on downstream tasks. Our codes are available at https://github.com/IBM/SafeLoRA.


Safe LoRA: The Silver Lining of Reducing Safety Risks when Finetuning Large Language Models

Neural Information Processing Systems

While large language models (LLMs) such as Llama-2 or GPT-4 have shown impressive zero-shot performance, fine-tuning is still necessary to enhance their performance for customized datasets, domain-specific tasks, or other private needs. However, fine-tuning all parameters of LLMs requires significant hardware resources, which can be impractical for typical users. Therefore, parameter-efficient fine-tuning such as LoRA have emerged, allowing users to fine-tune LLMs without the need for considerable computing resources, with little performance degradation compared to fine-tuning all parameters. Unfortunately, recent studies indicate that fine-tuning can increase the risk to the safety of LLMs, even when data does not contain malicious content. To address this challenge, we propose \textsf{Safe LoRA}, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace, effectively reducing the safety risks in LLM fine-tuning while maintaining utility.


Investigating cybersecurity incidents using large language models in latest-generation wireless networks

Legashev, Leonid, Zhigalov, Arthur

arXiv.org Artificial Intelligence

The purpose of research: Detection of cybersecurity incidents and analysis of decision support and assessment of the effectiveness of measures to counter information security threats based on modern generative models. The methods of research: Emulation of signal propagation data in MIMO systems, synthesis of adversarial examples, execution of adversarial attacks on machine learning models, fine tuning of large language models for detecting adversarial attacks, explainability of decisions on detecting cybersecurity incidents based on the prompts technique. Scientific novelty: A binary classification of data poisoning attacks was performed using large language models, and the possibility of using large language models for investigating cybersecurity incidents in the latest generation wireless networks was investigated. The result of research: Fine-tuning of large language models was performed on the prepared data of the emulated wireless network segment. Six large language models were compared for detecting adversarial attacks, and the capabilities of explaining decisions made by a large language model were investigated. The Gemma-7b model showed the best results according to the metrics Precision = 0.89, Recall = 0.89 and F1-Score = 0.89. Based on various explainability prompts, the Gemma-7b model notes inconsistencies in the compromised data under study, performs feature importance analysis and provides various recommendations for mitigating the consequences of adversarial attacks. Large language models integrated with binary classifiers of network threats have significant potential for practical application in the field of cybersecurity incident investigation, decision support and assessing the effectiveness of measures to counter information security threats.


Reviews: Blind Attacks on Machine Learners

Neural Information Processing Systems

Overall, I find this branch of work quite interesting and am glad the authors are choosing to study this problem. The attacks mentioned in the paper may become feasible in the age of large web-scale datasets or human-in-the-loop training systems, along with the privacy scenario mentioned in the paper. The authors do an excellent job of motivating the problem. The paper appears clearly written if the reader is an expert within the field of statistical decision theory. I must admit that this is not my area of expertise.


VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data

Du, Xuefeng, Ghosh, Reshmi, Sim, Robert, Salem, Ahmed, Carvalho, Vitor, Lawton, Emily, Li, Yixuan, Stokes, Jack W.

arXiv.org Artificial Intelligence

Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiment shows VLMGuard achieves superior detection results, significantly outperforming state-of-the-art methods. Disclaimer: This paper may contain offensive examples; reader discretion is advised.


Adversarial Attacks on Machine Learning Cybersecurity Defences in Industrial Control Systems

Anthi, Eirini, Williams, Lowri, Rhode, Matilda, Burnap, Pete, Wedgbury, Adam

arXiv.org Machine Learning

The proliferation and application of machine learning based Intrusion Detection Systems (IDS) have allowed for more flexibility and efficiency in the automated detection of cyber attacks in Industrial Control Systems (ICS). However, the introduction of such IDSs has also created an additional attack vector; the learning models may also be subject to cyber attacks, otherwise referred to as Adversarial Machine Learning (AML). Such attacks may have severe consequences in ICS systems, as adversaries could potentially bypass the IDS. This could lead to delayed attack detection which may result in infrastructure damages, financial loss, and even loss of life. This paper explores how adversarial learning can be used to target supervised models by generating adversarial samples using the Jacobian-based Saliency Map attack and exploring classification behaviours. The analysis also includes the exploration of how such samples can support the robustness of supervised models using adversarial training. An authentic power system dataset was used to support the experiments presented herein. Overall, the classification performance of two widely used classifiers, Random Forest and J48, decreased by 16 and 20 percentage points when adversarial samples were present. Their performances improved following adversarial training, demonstrating their robustness towards such attacks.


Deep learning and machine learning to transform cybersecurity

#artificialintelligence

CYBERSECURITY specialists have been betting on artificial intelligence (AI) to defend their organizations against sophisticated cyberattacks for quite a while now -- and it seems as though deep learning and machine learning have the potential to deliver. AI is a broad term that encompasses computer vision, machine learning, and deep learning, and generally offers the ability to mimic human actions, intelligently, and at incredible speed. For hackers trying to "guess" a password, it means AI can not only use "trial and error" to break into a victim's account much faster but also do it intelligently so that that the account doesn't get locked before the right password is guessed. On the other side of the fence, or network, cybersecurity professionals didn't immediately benefit from AI because systems in place don't automatically lend themselves to the technology -- however, experts bet on two niche elements of AI to find a solution. Those niche areas are machine learning and deep learning. Machine learning, simply put, is an algorithm that learns from a chunk of structured, labeled data to produce insights.


"Did I Say Something Wrong?": Towards a Safe Collaborative Chatbot

Chkroun, Merav (Ariel University) | Azaria, Amos (Ariel University)

AAAI Conferences

Chatbots have been a core measure of AI since Turing has presented his test for intelligence, and are also widely used for entertainment purposes. In this paper we present a platform that enables users to collaboratively teach a chatbot responses, using natural language. We present a method of collectively detecting malicious users and using the commands taught by these users to further mitigate activity of future malicious users.